Search CORE

31 research outputs found

A Computational Theory of the Use-Mention Distinction in Natural Language

Author: Wilson Shomir
Publication venue
Publication date: 01/01/2011
Field of study

To understand the language we use, we sometimes must turn language on itself, and we do this through an understanding of the use-mention distinction. In particular, we are able to recognize mentioned language: that is, tokens (e.g., words, phrases, sentences, letters, symbols, sounds) produced to draw attention to linguistic properties that they possess. Evidence suggests that humans frequently employ the use-mention distinction, and we would be severely handicapped without it; mentioned language frequently occurs for the introduction of new words, attribution of statements, explanation of meaning, and assignment of names. Moreover, just as we benefit from mutual recognition of the use-mention distinction, the potential exists for us to benefit from language technologies that recognize it as well. With a better understanding of the use-mention distinction, applications can be built to extract valuable information from mentioned language, leading to better language learning materials, precise dictionary building tools, and highly adaptive computer dialogue systems. This dissertation presents the first computational study of how the use-mention distinction occurs in natural language, with a focus on occurrences of mentioned language. Three specific contributions are made. The first is a framework for identifying and analyzing instances of mentioned language, in an effort to reconcile elements of previous theoretical work for practical use. Definitions for mentioned language, metalanguage, and quotation have been formulated, and a procedural rubric has been constructed for labeling instances of mentioned language. The second is a sequence of three labeled corpora of mentioned language, containing delineated instances of the phenomenon. The corpora illustrate the variety of mentioned language, and they enable analysis of how the phenomenon relates to sentence structure. Using these corpora, inter-annotator agreement studies have quantified the concurrence of human readers in labeling the phenomenon. The third contribution is a method for identifying common forms of mentioned language in text, using patterns in metalanguage and sentence structure. Although the full breadth of the phenomenon is likely to elude computational tools for the foreseeable future, some specific, common rules for detecting and delineating mentioned language have been shown to perform well

Digital Repository at the University of Maryland

Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models

Author: Srinath Mukund
Venkit Pranav Narayanan
Wilson Shomir
Publication venue
Publication date: 18/07/2023
Field of study

We analyze sentiment analysis and toxicity detection models to detect the presence of explicit bias against people with disability (PWD). We employ the bias identification framework of Perturbation Sensitivity Analysis to examine conversations related to PWD on social media platforms, specifically Twitter and Reddit, in order to gain insight into how disability bias is disseminated in real-world social settings. We then create the \textit{Bias Identification Test in Sentiment} (BITS) corpus to quantify explicit disability bias in any sentiment analysis and toxicity detection models. Our study utilizes BITS to uncover significant biases in four open AIaaS (AI as a Service) sentiment analysis tools, namely TextBlob, VADER, Google Cloud Natural Language API, DistilBERT and two toxicity detection models, namely two versions of Toxic-BERT. Our findings indicate that all of these models exhibit statistically significant explicit bias against PWD.Comment: TrustNLP at ACL 202

arXiv.org e-Print Archive

This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities

Author: Black Alan W
Oberlander Jon
Wilson Shomir
Publication venue
Publication date: 01/01/2016
Field of study

Writing intended to inform frequently con-tains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, sug-gestions, points). Such references are vital to the interpretation of documents, but they of-ten eschew identifiers such as "Figure 1 " for inexplicit phrases like "in this figure " or "from these premises". We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detec-tion into the determination of relevant word senses. We then show the feasibility of ma-chine learning for the detection of DE-relevant word senses, using a corpus of hu-man-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: web-site privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE refer-ences will enable language technologies to use the information encoded by them, permit-ting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.

CiteSeerX

Edinburgh Research Explorer

Effects of Online Self-Disclosure on Social Feedback During the COVID-19 Pandemic

Author: Lee Jooyoung
Rajtmajer Sarah
Srivatsavaya Eesha
Wilson Shomir
Publication venue
Publication date: 21/09/2022
Field of study

We investigate relationships between online self-disclosure and received social feedback during the COVID-19 crisis. We crawl a total of 2,399 posts and 29,851 associated comments from the r/COVID19_support subreddit and manually extract fine-grained personal information categories and types of social support sought from each post. We develop a BERT-based ensemble classifier to automatically identify types of support offered in users' comments. We then analyze the effect of personal information sharing and posts' topical, lexical, and sentiment markers on the acquisition of support and five interaction measures (submission scores, the number of comments, the number of unique commenters, the length and sentiments of comments). Our findings show that: 1) users were more likely to share their age, education, and location information when seeking both informational and emotional support, as opposed to pursuing either one; 2) while personal information sharing was positively correlated with receiving informational support when requested, it did not correlate with emotional support; 3) as the degree of self-disclosure increased, information support seekers obtained higher submission scores and longer comments, whereas emotional support seekers' self-disclosure resulted in lower submission scores, fewer comments, and fewer unique commenters; 4) post characteristics affecting social feedback differed significantly based on types of support sought by post authors. These results provide empirical evidence for the varying effects of self-disclosure on acquiring desired support and user involvement online during the COVID-19 pandemic. Furthermore, this work can assist support seekers hoping to enhance and prioritize specific types of social feedback

arXiv.org e-Print Archive

Survey on Sociodemographic Bias in Natural Language Processing

Author: Gupta Vipul
Passonneau Rebecca J.
Venkit Pranav Narayanan
Wilson Shomir
Publication venue
Publication date: 26/06/2023
Field of study

Deep neural networks often learn unintended biases during training, which might have harmful effects when deployed in real-world settings. This paper surveys 209 papers on bias in NLP models, most of which address sociodemographic bias. To better understand the distinction between bias and real-world harm, we turn to ideas from psychology and behavioral economics to propose a definition for sociodemographic bias. We identify three main categories of NLP bias research: types of bias, quantifying bias, and debiasing. We conclude that current approaches on quantifying bias face reliability issues, that many of the bias metrics do not relate to real-world biases, and that current debiasing techniques are superficial and hide bias rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur

arXiv.org e-Print Archive

Unmasking Nationality Bias: A Study of Human Perception of Nationalities in AI-Generated Articles

Author: Gautam Sanjana
Huang Ting-Hao `Kenneth'
Panchanadikar Ruchi
Venkit Pranav Narayanan
Wilson Shomir
Publication venue
Publication date: 08/08/2023
Field of study

We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. Through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by AI sources. We then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. Our findings reveal that biased NLP models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. The qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. These findings emphasize the critical role of public perception in shaping AI's impact on society and the need to correct biases in AI systems

arXiv.org e-Print Archive

Nationality Bias in Text Generation

Author: Gautam Sanjana
Huang Ting-Hao 'Kenneth'
Panchanadikar Ruchi
Venkit Pranav Narayanan
Wilson Shomir
Publication venue
Publication date: 14/02/2023
Field of study

Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country's economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.Comment: Paper accepted in the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL2023

arXiv.org e-Print Archive

Understanding How to Inform Blind and Low-Vision Users about Data Privacy through Privacy Question Answering Assistants

Author: Chen Rex
Feng Yuanyuan
Ravichander Abhilasha
Sadeh Norman
Wilson Shomir
Yao Yaxing
Zhang Shikun
Publication venue
Publication date: 12/10/2023
Field of study

Understanding and managing data privacy in the digital world can be challenging for sighted users, let alone blind and low-vision (BLV) users. There is limited research on how BLV users, who have special accessibility needs, navigate data privacy, and how potential privacy tools could assist them. We conducted an in-depth qualitative study with 21 US BLV participants to understand their data privacy risk perception and mitigation, as well as their information behaviors related to data privacy. We also explored BLV users' attitudes towards potential privacy question answering (Q&A) assistants that enable them to better navigate data privacy information. We found that BLV users face heightened security and privacy risks, but their risk mitigation is often insufficient. They do not necessarily seek data privacy information but clearly recognize the benefits of a potential privacy Q&A assistant. They also expect privacy Q&A assistants to possess cross-platform compatibility, support multi-modality, and demonstrate robust functionality. Our study sheds light on BLV users' expectations when it comes to usability, accessibility, trust and equity issues regarding digital data privacy.Comment: This research paper is accepted by USENIX Security '2

arXiv.org e-Print Archive